Skip to content

feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233

Open
xuexu6666 wants to merge 9 commits into
mainfrom
xuxue/gpu-mode-mig-metadata
Open

feat(gpu-crud): record gpu_mode and MIG strategy on every node-pool op#1233
xuexu6666 wants to merge 9 commits into
mainfrom
xuxue/gpu-mode-mig-metadata

Conversation

@xuexu6666

@xuexu6666 xuexu6666 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Problem

GPU CRUD benchmark records could not reliably distinguish managed vs fully-managed GPU pools. The stable azure-mgmt-containerservice SDK does not model gpuProfile.nvidia.managementMode, so a fully-managed pool's mode is silently dropped from the nodepool_info read-back — leaving only the pool name (a brittle string) to tell them apart. On top of that:

  • MIG single vs mixed was dropped on scale operations.
  • scale_node_pool never recorded the managed flag at all.

What this does

Derive a normalized set of GPU metadata from the operation input flags (not the lossy AKS read-back) and attach it to create, scale, and progressive-scale operations:

field values
gpu_mode none | managed | fully_managed
enable_managed_gpu normalized fully-managed flag
mig_enabled bool
gpu_instance_profile e.g. MIG1g | null
gpu_mig_strategy single | mixed | null
  • Thread gpu_mig_strategy through the scale path (main.pyNodePoolCRUDAKSClient_progressive_scale) and through execute.yml so scale-up/scale-down records are accurate.
  • Echo the metadata to the console on each op (GPU pool metadata: gpu_mode=… mig_enabled=… …) so it is visible in pipeline logs, not only in the uploaded results.json.

Robustness / review hardening

  • gpu_instance_profile / gpu_mig_strategy are Azure-only MIG inputs; the AWS CRUD does not accept them (no **kwargs). They are now only forwarded when --cloud azure, avoiding a TypeError on AWS runs.
  • _gpu_mode_metadata normalizes flag combinations so records are always internally consistent: managed/MIG only apply to a GPU pool; MIG only applies to fully-managed pools (dropped otherwise); invalid gpu_mig_strategy is rejected.
  • execute.yml gates --gpu-instance-profile / --gpu-mig-strategy on ENABLE_MANAGED_GPU==true (create, scale-up, scale-down) so MIG flags are only emitted for fully-managed pools.

Incidental fixes (folded in)

  • Scale poller timeout 20m → 30m (_begin_update_with_retry): A100 MIG node provisioning could exceed the old 1200s timeout, causing spurious scale failures.
  • verify_nvidia_smi_on_node MIG-single race: the readiness wait broke as soon as the nvidia.com/gpu key appeared, even with value 0. A freshly-scaled MIG-single node registers nvidia.com/gpu=0 before publishing its MIG instances, so the check raced in, read 0, and skipped the node with a misleading "has no GPUs" warning. Now waits for a positive GPU/MIG count.

Verification

Validated end-to-end on the GPU Cluster CRUD pipeline — latest run build 71637 (commit e07404fa, succeeded), A100 chain managed → fully_managed → MIG-mixed → MIG-single.

Recorded metadata in the published results.json is correct on every create/scale_up/scale_down op (confirmed across builds 71550/71623/71637):

stage gpu_mode enable_managed_gpu mig_enabled gpu_instance_profile gpu_mig_strategy
managed managed False False None None
fully_managed fully_managed True False None None
MIG mixed fully_managed True True MIG1g mixed
MIG single fully_managed True True MIG1g single

Also confirmed in build 71637 logs:

  • Console echo prints per op, e.g. GPU pool metadata: gpu_mode=managed enable_managed_gpu=False mig_enabled=False gpu_instance_profile=None gpu_mig_strategy=None.
  • The MIG-single nvidia.com/gpu=0 race is fixed — a100migsin is no longer skipped.
  • No scale-poller timeout (30-min bump applied).

managed vs fully_managed are now distinguishable, and MIG mixed vs single is recorded on scale ops as well as create.

Known / out-of-scope (not introduced by this PR)

Observed in build 71637 but unrelated to these changes:

  • A100 vCPU quota exhaustion in southcentralus (ErrCode_InsufficientVCPUQuota, requested 192 / remaining 0) during an a100migmixed scale-up — environmental capacity, not a code issue.
  • Managed-pool GPU advertisement: a100managed (driver-install) nodes did not advertise nvidia.com/gpu within the wait window, so nvidia-smi verification skipped them. This is pre-existing behavior (the prior code also waited and skipped, since the key never appears) — a device-plugin scheduling/timing issue on the managed A100 path, worth a separate follow-up.

Testing

  • Unit tests for the metadata helper variants, normalization (MIG dropped for non-fully-managed, managed flag normalized without a GPU pool, invalid strategy rejected), scale-op metadata persistence, console echo, an AWS regression test asserting MIG kwargs are omitted for --cloud aws, and a verify_nvidia_smi_on_node regression test (node reports 0 then a positive count → verified, not skipped).
  • End-to-end validated via build 71637 (tables above).

GPU CRUD benchmark records could not reliably distinguish managed vs
fully-managed GPU pools: the stable azure-mgmt-containerservice SDK does
not model gpuProfile.nvidia.managementMode, so a fully-managed pool's mode
is silently dropped from the nodepool_info read-back, leaving only the pool
name (a brittle string) to tell them apart. MIG single vs mixed was also
dropped on scale ops, and scale_node_pool never even recorded the managed
flag.

Derive a normalized set of GPU metadata from the operation INPUT flags and
attach it to create, scale, and progressive-scale operations:
  - gpu_mode: "none" | "managed" | "fully_managed"
  - enable_managed_gpu, mig_enabled
  - gpu_instance_profile, gpu_mig_strategy ("single" | "mixed")

Thread gpu_mig_strategy through the scale path (main.py -> NodePoolCRUD ->
AKSClient -> _progressive_scale), and pass the scale-path GPU flags through
execute.yml so scale-up/scale-down records are accurate in the pipeline.

Add unit tests for the metadata helper variants and for scale ops persisting
gpu_mode + MIG fields.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the fidelity of GPU CRUD benchmark records by deriving and persisting normalized GPU metadata (managed vs fully-managed mode and MIG configuration) from operation input flags, rather than relying on AKS SDK read-back fields that may be missing.

Changes:

  • Add a normalized GPU metadata helper (gpu_mode, MIG fields) and attach it to create/scale/progressive-scale operation metadata.
  • Thread gpu_mig_strategy through the Azure scale call chain (main.py → NodePoolCRUD → AKSClient → _progressive_scale).
  • Update the pipeline execution step to pass MIG strategy on scale operations, and add unit tests covering metadata normalization and scale metadata persistence.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
steps/engine/crud/k8s/execute.yml Passes additional GPU/MIG CLI flags into create/scale steps so pipeline records can include them.
modules/python/tests/clients/test_aks_client.py Adds unit tests for GPU metadata normalization and scale operation metadata persistence.
modules/python/crud/main.py Threads gpu_mig_strategy into scale/all kwargs so the CLI can pass it into CRUD implementations.
modules/python/crud/azure/node_pool_crud.py Extends Azure NodePoolCRUD signatures/forwarding to include gpu_mig_strategy.
modules/python/clients/aks_client.py Introduces _gpu_mode_metadata and attaches it to operation metadata across create/scale/progressive-scale.

Comment thread modules/python/crud/main.py
Comment thread modules/python/crud/main.py Outdated
Comment thread steps/engine/crud/k8s/execute.yml Outdated
Comment thread steps/engine/crud/k8s/execute.yml Outdated
Comment thread modules/python/clients/aks_client.py Outdated
Existing exact-call assertions in test_main.py and test_azure_node_pool_crud.py
broke because scale_node_pool / all now receive gpu_mig_strategy. Add the new
kwarg to the expected calls.
The helper is intentionally tested directly; disable W0212 for that test method.
…-access

The def-line block disable did not suppress W0212 in CI; bind the protected
helper to a single inline-disabled local and call that instead.
…t GPU metadata

- main.py: gpu_instance_profile/gpu_mig_strategy are Azure-only MIG inputs; the
  AWS CRUD does not accept them (no **kwargs) and would raise TypeError. Only
  forward them when --cloud azure, via a conditional azure_gpu_kwargs dict.
- aks_client._gpu_mode_metadata: normalize flag combinations so records are
  always internally consistent — enable_managed_gpu/MIG only apply to a GPU
  pool; MIG only applies to fully-managed pools (dropped otherwise); reject
  invalid gpu_mig_strategy values.
- execute.yml: gate --gpu-instance-profile/--gpu-mig-strategy on
  ENABLE_MANAGED_GPU==true (create, scale-up, scale-down) so MIG flags are only
  emitted for fully-managed pools.
- tests: set cloud="azure" on the GPU-kwarg cases, add an AWS regression test
  asserting MIG kwargs are omitted, and cover the new metadata normalization.
Log gpu_mode/enable_managed_gpu/mig_enabled/gpu_instance_profile/gpu_mig_strategy
on each create/scale/progressive-scale op so the GPU mode is visible directly in
pipeline logs (not only in the uploaded results.json). Skipped for non-GPU pools.
…s (C0302)

The console-echo helper pushed the module to 1014 lines (limit 1000). Condense
the _gpu_mode_metadata docstring and the gpu_mig_strategy ValueError message.
A100 MIG-single node provisioning can exceed the previous 1200s (20 min) scale
poller timeout, causing spurious scale failures. Bump _begin_update_with_retry
default timeout to 1800s (30 min).
…n nvidia-smi verify

verify_nvidia_smi_on_node broke its readiness wait as soon as the nvidia.com/gpu
key appeared, even with value 0. On a freshly-scaled MIG-single node the device
plugin registers nvidia.com/gpu=0 before publishing the MIG instances, so the
check raced in, read 0, and skipped the node with a misleading 'has no GPUs'
warning. Wait for a positive GPU or MIG slice count instead of mere key presence.

Tests: update no-GPU test for the new wait semantics; add a regression test where
a node reports 0 then a positive count and must be verified (not skipped).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants